Exploratory Data Analysis | Swire Coca-Cola Capstone Project

IS 6813-001, Spring 2025 | Group 3

Authors

Adam Bushman

Georgia Christodoulou

Tyler Swanson

Zac Mendenhall

Published

February 23, 2025


Business Problem Statement

Regional beverage bottler Swire Coca-Cola (SCCU) relies on two business models: 1) “Red Truck”, which features high-volume customers serviced personally by Swire, and 2) “White Truck” or “Alternate Route to Market”, which entails smaller customers serviced by a third-party distributor.

Swire’s current segmenting strategy has led to misallocation of resources, inflated expenses, and missed opportunities from clients with high-growth potential. Swire aims to better algin their customers with the business proposition of these models by identifying customer characteristics and rdering behavior that better determines the right business model for the long-term relationship.

EDA Introduction

In this exploratory data analysis (EDA) notebook, we assess the quality of the provided data, detail the data cleaning processes undertaken, define our success metrics, and outline the key questions driving our analysis. By examining customer characteristics and ordering behaviors, we aim to realign Swire’s customer segmentation with the strategic propositions of their two distribution models, ultimately enabling a more effective long-term relationship strategy.

File Investigation

Libraries & Data

We begin by setting up our session with the necessary libraries and data provided by Swire. These will be referenced often throughout this document.

# Load libraries for EDA
library('tidyverse')  # Data wrangling & visualization
library('gt')         # Create professional tables
library('janitor')    # Clean column names & messy data
library('psych')      # Descriptive stats & psych research tools
library('stringr')    # String manipulation
library('lubridate')  # Handle dates & times
library('rmarkdown')  # Render R Markdown docs
library('dplyr')      # Data manipulation (filter, select, mutate, etc.)
library('skimr')      # Quick summary stats
library('tidyr')      # Reshape/tidy data
library('readxl')     # Read Excel files
library('ggplot2')    # Data visualization
library('readr')      # Read/write CSV & text files
library('knitr')      # Generate dynamic reports
library('leaflet')    # Generate dynamic map

setwd("/home/adam-bushman/All Projects/delivery-standardization-group")


# Read files into session
transactions <- as.data.frame(data.table::fread("data/original/transactional_data.csv"))
customer_address <- read.csv("data/original/customer_address_and_zip_mapping.csv")
customer_profile <- read.csv("data/original/customer_profile.csv")
delivery_cost <- readxl::read_xlsx('data/original/delivery_cost_data.xlsx')

Files from Swire

Swire Coca-Cola provided us with four (4) files we can use for the project. These include:

Name Description Format Columns Rows
transactional_data This dataset records detailed transactional information, including order quantities and delivery metrics. .csv 11 1,045,540
customer_address_and_zip_mapping This dataset maps ZIP codes to full address information. .csv 2 1,801
customer_profile This dataset provides detailed information about customers, including onboarding and purchasing behavior. .csv 11 30,478
delivery_cost_data This dataset describes the median 'per gallon/case' cost of delivery according to annual order volume and custoemr profile. .xlsx 5 160

Let’s take a peak at each of these in turn. We’ll remark on:

  1. Completeness (how many data are missing)
  2. Data types (are fields properly formatted for analysis)
  3. Distribution (how values are spread out)
  4. Potential wrangling (what we can do for improving data set quality)

transactional_data.csv

skim(transactions)
Data summary
Name transactions
Number of rows 1045540
Number of columns 11
_______________________
Column type frequency:
character 2
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
TRANSACTION_DATE 0 1 8 10 0 723 0
ORDER_TYPE 0 1 3 13 0 7 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
WEEK 0 1 26.23 14.52 1.0 14 26 38.00 52.00 ▇▇▇▇▇
YEAR 0 1 2023.50 0.50 2023.0 2023 2023 2024.00 2024.00 ▇▁▁▁▇
CUSTOMER_NUMBER 0 1 546643776.32 49426585.56 500245678.0 501091920 501548213 600080939.00 600975408.00 ▇▁▁▁▇
ORDERED_CASES 0 1 26.85 126.76 0.0 0 7 18.50 8479.89 ▇▁▁▁▁
LOADED_CASES 0 1 25.92 122.79 0.0 0 7 18.00 8171.56 ▇▁▁▁▁
DELIVERED_CASES 0 1 25.13 121.52 -3132.0 0 6 17.33 8069.48 ▁▇▁▁▁
ORDERED_GALLONS 0 1 9.87 26.47 0.0 0 0 12.50 2562.50 ▇▁▁▁▁
LOADED_GALLONS 0 1 9.60 25.65 0.0 0 0 12.50 2562.50 ▇▁▁▁▁
DELIVERED_GALLONS 0 1 9.21 25.18 -1792.5 0 0 12.50 2292.50 ▁▁▇▁▁
glimpse(transactions)
Rows: 1,045,540
Columns: 11
$ TRANSACTION_DATE  <chr> "1/5/2023", "1/6/2023", "1/9/2023", "1/11/2023", "1/…
$ WEEK              <int> 1, 1, 2, 2, 3, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 7, 7, 9…
$ YEAR              <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023…
$ CUSTOMER_NUMBER   <int> 501202893, 500264574, 501174701, 600586532, 50101432…
$ ORDER_TYPE        <chr> "MYCOKE LEGACY", "MYCOKE LEGACY", "MYCOKE LEGACY", "…
$ ORDERED_CASES     <dbl> 1.0, 12.5, 2.0, 18.0, 29.0, 1.5, 6.0, 0.0, 4.0, 18.0…
$ LOADED_CASES      <dbl> 1.0, 12.5, 2.0, 16.0, 29.0, 1.5, 5.0, 0.0, 1.0, 17.0…
$ DELIVERED_CASES   <dbl> 1.0, 12.5, 2.0, 16.0, 29.0, 1.5, 5.0, 0.0, 1.0, 17.0…
$ ORDERED_GALLONS   <dbl> 90.000000, 0.000000, 0.000000, 2.500000, 0.000000, 0…
$ LOADED_GALLONS    <dbl> 90.000000, 0.000000, 0.000000, 2.500000, 0.000000, 0…
$ DELIVERED_GALLONS <dbl> 90.000000, 0.000000, 0.000000, 2.500000, 0.000000, 0…

Completeness

The data are fully complete; there’s not a single column with missing or empty values.

Data Types

We have a fair mix of data types, though some are improperly casted. Indentifiers like CUSOTMER_NUMBER shouldn’t be continuous, and dates like TRANSACTION_DATE can be enriched with a date/datetime format.

Distribution

The shape of the data seen is quite skewed: most all of the volume columns are highly skewed right (meaning most transactions are small). We do see some “negative” delivery volumes, which we are to understand as returns.

Potential Wrangling

This data set will benefit most from proper data types. Additionally, we can determine (for each customer) how often they see returned transactions (a potentially helpful indicator for client quality).

We don’t yet have a good sense for 1) the breadth of data longitudinally and 2) what order types exist. Let’s solve these with some basic visualizations.

# Prepare and plot the daily transaction volume over time
transactions |>
  
  # Convert the TRANSACTION_DATE column from character to Date type
  mutate(
    TRANSACTION_DATE = lubridate::mdy(TRANSACTION_DATE)  # mdy() parses dates in month-day-year format
  ) |>
  
  # Group by transaction date and count number of transactions per day
  group_by(TRANSACTION_DATE) |>
  count() |>
  ungroup() |>  # Remove grouping to simplify plotting
  
  # Create a time series line plot of daily transaction counts
  ggplot(
      aes(TRANSACTION_DATE, n)  # Map x to date, y to daily count
    ) +
    geom_line(color = swire_colors$blue) +  # Line plot of transaction volume
    geom_smooth(color = swire_colors$red) +  # Smoothed trend line for easier interpretation
    scale_x_date() +  # Format the x-axis for dates
    labs(
      title = "Daily Transaction Volume",  # Title of the plot
      x = "Transaction Date"  # Label for the x-axis
    ) +
    theme_swire() +  # Apply custom ggplot theme
    theme(
      axis.title.y = element_blank()  # Remove y-axis title for cleaner look
    )
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

There’s obviously high variability in the number of transactions per day, but we are seeing a smoothing where the average day per week sees a transaction volume between 1,000 and 2,000. Additionally, it’s clear we have two (2) years of transactions: 2023 & 2024.

# Summarize the number and percentage of transactions by ORDER_TYPE
order_types <- 
  transactions |>
  group_by(ORDER_TYPE) |>
  count() |>
  ungroup() |>
  mutate(
    perc = n / sum(n)  # Calculate percent share for each order type
  ) |>
  arrange(n)  # Order from smallest to largest count

# Reorder factor levels to match ascending order of counts for plotting
order_types$ORDER_TYPE = factor(order_types$ORDER_TYPE, levels = order_types$ORDER_TYPE)

# Create a horizontal bar chart showing the distribution of ORDER_TYPE
ggplot(
    order_types, 
    aes(n, ORDER_TYPE, label = scales::percent_format()(perc))  # Format labels as percentages
  ) +
  geom_col(fill = swire_colors$red) +  # Red bars for each order type
  geom_text(
    aes(
      hjust = ifelse(n < 1e5, -0.1, 1.1),  # Adjust horizontal text position based on bar length
      color = ifelse(n < 1e5, "black", "white")  # Use black text for short bars, white for long
    )
  ) +
  scale_color_identity() +  # Use the raw color values from the data without scaling
  labs(
    title = "Distribution of `ORDER_TYPE`"  # Plot title
  ) +
  theme_swire() +  # Apply custom ggplot theme
  theme(
    axis.title = element_blank(),  # Remove axis titles
    axis.text.x = element_blank(),  # Hide x-axis text
    axis.ticks.x = element_blank()  # Hide x-axis ticks
  )

Okay, we now see there’s 6 unique types and some values that indicate null. That appears to be a text “null” so that’s an opportunity to clean up. There’s clearly 3 order types of largest interest:

  • MYCOKE LEGACY
  • CALL CENTER
  • SALES REP

customer_address_and_zip_mapping.csv

skim(customer_address)
Data summary
Name customer_address
Number of rows 1801
Number of columns 2
_______________________
Column type frequency:
character 1
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
full.address 0 1 45 73 0 1801 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
zip 0 1 28919.81 25588.64 1001 2153 21634 42440 71483 ▇▃▅▁▅
glimpse(customer_address)
Rows: 1,801
Columns: 2
$ zip          <int> 71018, 71021, 71023, 71024, 71039, 71055, 71058, 71071, 7…
$ full.address <chr> "71018,Cotton Valley,Louisiana,LA,Webster,119,32.819,-93.…

Completeness

We aren’t seeing any incomplete/missing values so far here. That’s great news.

Data Types

The data types aren’t quite right. Because of leading zeros and 9 digit zip codes, zip should really be of type “character”. Also, we’re seeing the full address is comma-separated; not ideal.

Distribution

There’s no field where assessing distribution is helpful/relevant.

Potential Wrangling

Pulling out the values from full.address will be helpful. We can then make sure those are of proper data types.

One thing we don’t have a great sense for right now is where the addresses are located. Let’s see if we can’t make a dynamic map showing where customers are located:

Swire Market Geography

We’ll first pull out the latitude/longitude values.

# Split the `full.address` column into multiple geographic components
addr_exp <- 
  customer_address |>
  separate(
    full.address, 
    into = c("zip", "city", "state", "state_abbr", "county", "region", "lat", "lon"),  # New columns to create
    sep = ",",  # Split on commas
    convert = TRUE  # Automatically convert resulting columns to appropriate types (e.g., numeric for lat/lon)
  )

Then we’ll add them to a map:

# Create an interactive leaflet map showing customer address locations
swire_map <-                                                              
    leaflet(height = 800, width = 800) |>  # Initialize leaflet map with set dimensions
    addTiles() |>  # Add default OpenStreetMap tile layer as a base
    setView(
        lng = mean(addr_exp$lon),  # Center map horizontally at average longitude
        lat = mean(addr_exp$lat),  # Center map vertically at average latitude
        zoom = 5  # Set initial zoom level for a broad regional view
    ) |>  
    addProviderTiles("CartoDB.Positron") |>  # Add light-themed basemap from CartoDB
    addCircleMarkers(
        data = addr_exp,  # Use the expanded address dataset
        lng = ~lon,  # Longitude from data
        lat = ~lat,  # Latitude from data
        radius = 6,  # Marker size
        color = swire_colors$red,  # Marker color using custom palette
        fillOpacity = 0.75  # Transparency for better visibility when overlapping
    )

# Render the map
swire_map

Okay, we’re seeing customers largely in Kansas, Kentucky, Maryland, and Massachusetts, with some in Louisiana. It’s not understood to what extent, if any, this geographic will hold value in a more optimized delivery model solution for Swire, but we’ll ensure the data are wrangled properly regardless.

customer_profile.csv

skim(customer_profile)
Data summary
Name customer_profile
Number of rows 30478
Number of columns 11
_______________________
Column type frequency:
character 6
logical 2
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
FREQUENT_ORDER_TYPE 0 1 3 13 0 6 0
FIRST_DELIVERY_DATE 0 1 8 10 0 2401 0
ON_BOARDING_DATE 0 1 8 10 0 6487 0
COLD_DRINK_CHANNEL 0 1 5 13 0 9 0
TRADE_CHANNEL 0 1 6 28 0 26 0
SUB_TRADE_CHANNEL 0 1 4 27 0 48 0

Variable type: logical

skim_variable n_missing complete_rate mean count
LOCAL_MARKET_PARTNER 0 1 0.90 TRU: 27355, FAL: 3123
CO2_CUSTOMER 0 1 0.39 FAL: 18496, TRU: 11982

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
CUSTOMER_NUMBER 0 1.0 538301800.92 47950644.47 500245678 501164306 501573994 600075795 600975408 ▇▁▁▁▅
PRIMARY_GROUP_NUMBER 18196 0.4 2779.85 2608.64 4 444 1892 4488 9999 ▇▃▂▁▁
ZIP_CODE 0 1.0 30252.25 25953.08 1001 2155 21771 42762 71483 ▇▃▅▁▆
glimpse(customer_profile)
Rows: 30,478
Columns: 11
$ CUSTOMER_NUMBER      <int> 501556470, 501363456, 600075150, 500823056, 60008…
$ PRIMARY_GROUP_NUMBER <int> 376, NA, 2158, 2183, 1892, NA, 9996, NA, NA, 8803…
$ FREQUENT_ORDER_TYPE  <chr> "MYCOKE LEGACY", "SALES REP", "SALES REP", "OTHER…
$ FIRST_DELIVERY_DATE  <chr> "1/2/2024", "4/14/2022", "3/4/2016", "2/6/2019", …
$ ON_BOARDING_DATE     <chr> "8/28/2023", "3/22/2022", "3/22/2012", "11/23/201…
$ COLD_DRINK_CHANNEL   <chr> "DINING", "DINING", "DINING", "DINING", "PUBLIC S…
$ TRADE_CHANNEL        <chr> "FAST CASUAL DINING", "COMPREHENSIVE DINING", "FA…
$ SUB_TRADE_CHANNEL    <chr> "PIZZA FAST FOOD", "FSR - MISC", "OTHER FAST FOOD…
$ LOCAL_MARKET_PARTNER <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, …
$ CO2_CUSTOMER         <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRU…
$ ZIP_CODE             <int> 21664, 1885, 67073, 1885, 1203, 67952, 40517, 156…

Completeness

All columns appear complete with the exception of PRIMARY_GROUP_NUMBER; there’s appropximately 60% of records with missing values here. This is interpreted as customers who are “stand alone”, as in not part of a corporate franchising arm. Therefore, it’s understandable these values are missing.

Data Types

Data types are largely appropriate. We have several “character” types we can turn into factors. Some dates should be appropriately cast. We do need to change identifiers to “character”. And similar processing as already described will be done for ZIP_CODE.

Distribution

There’s not a ton to note here except that the vast majority of customers are “local market parneters” and most aren’t purchasing C02 from Swire, either because they source elsewhere or are not fountain customers.

Potential Wrangling

It’s likely that a customer belonging to a franchise would be evaluated differently than a stand alone customer, all else being equal. We can derive a value for the number of franchises belonging to the primary group.

Let’s see what values/frequency is inherent to COLD_DRINK_CHANNEL:

# Summarize the number and percentage of customers by COLD_DRINK_CHANNEL
channel <- 
  customer_profile |>
  group_by(COLD_DRINK_CHANNEL) |>
  count() |>
  ungroup() |>
  mutate(
    perc = n / sum(n)  # Calculate percent share for each channel
  ) |>
  arrange(n)  # Order from smallest to largest count

# Reorder factor levels so bars display in ascending order
channel$COLD_DRINK_CHANNEL = factor(channel$COLD_DRINK_CHANNEL, levels = channel$COLD_DRINK_CHANNEL)

# Create a horizontal bar chart showing distribution of COLD_DRINK_CHANNEL
ggplot(
    channel, 
    aes(n, COLD_DRINK_CHANNEL, label = scales::percent_format()(perc))  # Show percentage labels
  ) +
  geom_col(fill = swire_colors$red) +  # Red bars for each channel
  geom_text(
    aes(
      hjust = ifelse(n < 10000, -0.1, 1.1),  # Position text inside or outside based on count
      color = ifelse(n < 10000, "black", "white")  # Ensure text is visible on the background
    )
  ) +
  scale_color_identity() +  # Use exact color values with no scale transformation
  labs(
    title = "Distribution of `COLD_DRINK_CHANNEL`"  # Plot title
  ) +
  theme_swire() +  # Apply custom ggplot theme
  theme(
    axis.title = element_blank(),  # Remove axis titles
    axis.text.x = element_blank(),  # Hide x-axis text
    axis.ticks.x = element_blank()  # Hide x-axis ticks
  )

There’s 9 different channels, largely dominated by “DINING”. Others of interest include “EVENT” and “GOODS”.

delivery_cost_data.xlsx

skim(delivery_cost)
Data summary
Name delivery_cost
Number of rows 160
Number of columns 5
_______________________
Column type frequency:
character 4
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Cold Drink Channel 0 1 5 13 0 8 0
Vol Range 0 1 5 11 0 10 0
Applicable To 0 1 8 16 0 2 0
Cost Type 0 1 8 10 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Median Delivery Cost 0 1 2.6 1.71 0.37 1.33 2.24 3.47 8.59 ▇▆▂▁▁
glimpse(delivery_cost)
Rows: 160
Columns: 5
$ `Cold Drink Channel`   <chr> "WORKPLACE", "WORKPLACE", "WORKPLACE", "WORKPLA…
$ `Vol Range`            <chr> "0 - 149", "150 - 299", "300 - 449", "450 - 599…
$ `Applicable To`        <chr> "Bottles and Cans", "Bottles and Cans", "Bottle…
$ `Median Delivery Cost` <dbl> 8.0649504, 4.1656458, 2.9915579, 2.5242219, 2.0…
$ `Cost Type`            <chr> "Per Case", "Per Case", "Per Case", "Per Case",…

Completeness

All values are present; there shouldn’t be any imputation necessary.

Data Types

Vol Range is a field that needs to be split out into a “min”/“max”. Everyting else seems very reasonable.

Distribution

In most cases, the distribution of Median Delivery Cost is skewed right, which is indicative of outlier costs for troublesome clients.

Potential Wrangling

It’s notable that the column names are formatted differently. This is an area of standardization. Additionally, we’ll need to do quite a bit of prep to get the appropriate annual volumes by customer and intersect with this lookup table.

# Ensure the factor order of Cold Drink Channel matches that used in the earlier chart
delivery_cost$`Cold Drink Channel` = factor(
  delivery_cost$`Cold Drink Channel`, 
  levels = channel$COLD_DRINK_CHANNEL  # Use levels from previous summary to ensure consistent ordering
)

# Create faceted boxplots showing delivery cost distribution by channel and product type
ggplot(
  delivery_cost, 
  aes(`Median Delivery Cost`, `Cold Drink Channel`)  # Map cost to x-axis, channel to y-axis
) +
  geom_boxplot(color = swire_colors$red) +  # Red boxplots for visual emphasis
  scale_x_continuous(labels = scales::label_currency()) +  # Format x-axis as currency
  facet_wrap(~`Applicable To`, nrow = 1, scales = "free") +  # Create separate plots for each product type
  labs(
    title = "Distribution of Median Delivery Cost",  # Plot title
    subtitle = "By `Cold Drink Channel` and `Product Type`"  # Clarify the facets
  ) +
  theme_swire() +  # Apply custom ggplot theme
  theme(
    axis.title = element_blank()  # Remove axis titles for cleaner appearance
  )

Interestingly, there’s some real variance among costs in some Cold Drink Channels but less in others. Additionally, it’s not consistent between “Bottles and Cans” and “Fountain” products.

Data Wrangling

Having explored the files individually and making note of important details, we can now move on to wrangling these data into a single object, suitable for the more in-depth analysis upcoming. This wrangling will comprise of:

  • Prep individual files for level-of-detail
  • Combining the four (4) individual files we collected
  • Standardizing column names
  • Developing a theory for ordering said columns
  • Casting fields to their proper data types
  • Derived columns that further describe customer profiles
  • Settle on transactional level data with appropriate cost estimates
  • Etc.

Prepping the Cost Data

delivery_cost_expanded <- 
    delivery_cost |>
    # Split the volume range into an object
    mutate(
        range_obj = purrr::map(`Vol Range`, str_split, " - ")
    ) |>
    # Unnest the object for individual reference
    unnest(range_obj) |>
    unnest_wider(range_obj, names_sep = "_") |>
    # Handle the "1350+" scenario
    mutate(
        min_vol = purrr::map_chr(range_obj_1, str_replace, "\\+", ""), 
        max_vol  = ifelse(is.na(range_obj_2), (2^31) - 1, range_obj_2)
    ) |>
    # Turn volumes from charaters to integers
    mutate(
        across(min_vol:max_vol, as.integer)
    ) |>
    # Drop irrelevant columns
    select(-c(range_obj_1, range_obj_2, `Vol Range`))
annual_cust_volume <-
    # Take transaction level data
    transactions |>
    # Bring in the customer profile for the `Cold Drink Channel`
    inner_join(
        customer_profile, 
        join_by(CUSTOMER_NUMBER)
    ) |>
    # Get annual cases/gallons by customer
    group_by(YEAR, CUSTOMER_NUMBER, COLD_DRINK_CHANNEL) |>
    summarise(
        annual_cases = sum(DELIVERED_CASES), 
        annual_gallons = sum(DELIVERED_GALLONS), 
        .groups = "drop"
    )
# Join delivery cost estimates to each customer's annual volume to assign appropriate tiered rates
delivery_cost_tiers <-
    annual_cust_volume |>

    # Join delivery costs for *case-based* products (excluding Fountain)
    left_join(
        delivery_cost_expanded |> 
            filter(`Applicable To` != 'Fountain'),  # Filter to non-fountain products
        join_by(
            COLD_DRINK_CHANNEL == `Cold Drink Channel`,  # Match by drink channel
            annual_cases >= min_vol,  # Ensure volume falls within the defined min
            annual_cases <= max_vol   # Ensure volume falls within the defined max
        )
    ) |>

    # Join delivery costs for *gallon-based* products (only Fountain)
    left_join(
        delivery_cost_expanded |> 
            filter(`Applicable To` == 'Fountain'),  # Filter only fountain records
        join_by(
            COLD_DRINK_CHANNEL == `Cold Drink Channel`,  # Match by drink channel
            annual_gallons >= min_vol,  # Match gallon volume to range
            annual_gallons <= max_vol
        ),
        suffix = c(".c", ".g")  # Add suffixes to distinguish overlapping column names
    ) |>

    # Select only the relevant fields for output
    select(
        YEAR, CUSTOMER_NUMBER, 
        case_delivery_cost = `Median Delivery Cost.c`,  # Cost per case delivery
        gallon_delivery_cost = `Median Delivery Cost.g`  # Cost per gallon delivery
    )

# Take a peek
head(delivery_cost_tiers)
# A tibble: 6 × 4
   YEAR CUSTOMER_NUMBER case_delivery_cost gallon_delivery_cost
  <int>           <int>              <dbl>                <dbl>
1  2023       500245678               4.47                 2.68
2  2023       500245685               8.59                 2.01
3  2023       500245686               7.33                 4.62
4  2023       500245687               5.59                 3.76
5  2023       500245689               8.59                 2.53
6  2023       500245690               8.59                 3.02

Prep the Customer Address Object

cust_addr_expanded <-
    customer_address |>
    # Split the full address into an object
    mutate(
        addr_obj = purrr::map(full.address, str_split, ",")
    ) |>
    # Unnest the object for individual reference
    unnest(addr_obj) |>
    unnest_wider(addr_obj, names_sep = "_") |>
    # Pad the zip code with leading zeros and make a character
    mutate(
        zip = str_pad(zip, 5, "left", pad = "0")
    ) |>
    # Rename columns
    rename(
        city = addr_obj_2, 
        state = addr_obj_3, 
        state_abbr = addr_obj_4, 
        county = addr_obj_5, 
        lat = addr_obj_7, 
        lon = addr_obj_8
    ) |>
    # Turn lat/lon values to numeric
    mutate(
        across(lat:lon, as.numeric)
    ) |>
    # Drop irrelevant columns
    select(-c(full.address, addr_obj_1, addr_obj_6))

Combine Individual Files

combined_data_raw <-
    # Take transactions
    transactions |>
    # Join the customer profile data thereto
    inner_join(
        customer_profile |> mutate(zip = str_pad(
            ZIP_CODE, 5, "left", "0"
        )), 
        join_by(CUSTOMER_NUMBER)
    ) |>
    # Join the customer address data thereto
    inner_join(
        cust_addr_expanded, 
        join_by(zip)
    ) |>
    # Join the delivery cost tiers data thereto
    inner_join(
        delivery_cost_tiers, 
        join_by(YEAR, CUSTOMER_NUMBER)
    )

Standardize Names & Data Types

combined_data_std <- 
    # Take the combined data from above
    combined_data_raw |>
    # Standardize the names
    clean_names() |>
    # Standardize data types
    mutate(
        # Convert charater dates to date types
        across(c(transaction_date, first_delivery_date, on_boarding_date), lubridate::mdy), 
        # Turn IDs into characters
        across(c(customer_number, primary_group_number), as.character), 
        # Turn finite categorical fields into factors
        across(
            c(order_type, cold_drink_channel, frequent_order_type, trade_channel, sub_trade_channel, state, state_abbr), 
            as.factor
        )
    ) |>
    # Remove irrelevant columns
    select(-c(zip_code))

Enrich Dataset with New Fields

swire_data_full <-
    combined_data_std |>
    # Add new fields
    mutate(
        # Calculate delivered gallons cost
        # Assume a return is only half as costly as a normal delivery
        delivered_gallons_cost = case_when(
            delivered_gallons < 0 ~ -1 * delivered_gallons * gallon_delivery_cost / 2, 
            TRUE ~ delivered_gallons * gallon_delivery_cost
        ), 
        # Calculate delivered case cost
        # Assume a return is only half as costly as a normal delivery
        delivered_cases_cost = case_when(
            delivered_cases < 0 ~ -1 * delivered_cases * case_delivery_cost / 2, 
            TRUE ~ delivered_cases * case_delivery_cost
        ),
        # Create 'total' columns representing the sum of gallons & cases
        ordered_total = ordered_gallons + ordered_cases, 
        loaded_total = loaded_gallons + loaded_cases, 
        delivered_total = delivered_gallons + delivered_cases, 
    ) |>
    group_by(year, primary_group_number) |>
    mutate(
        # Calculate number of customers belonging to each primary group by year
        primary_group_customers = ifelse(is.na(primary_group_number), 0, n_distinct(customer_number))
    ) |>
    group_by(year, customer_number) |>
    mutate(
        # Calculate how often a customer issues a return each year
        return_frequency = sum(ifelse(delivered_cases < 0 | delivered_gallons < 0, 1, 0))
    ) |>
    ungroup() |>
    # Drop select columns that are no longer relevant
    select(-c(gallon_delivery_cost, case_delivery_cost)) |>
    # Order the columns logically
    select(
        # CUSTOMER PROFILE ITEMS
        customer_number, primary_group_number, primary_group_customers, 
        on_boarding_date, first_delivery_date, cold_drink_channel, frequent_order_type, trade_channel, sub_trade_channel, local_market_partner, co2_customer, city, zip, state, state_abbr, county, lat, lon, 
        
        # TRANSACTION DETAILS
        transaction_date, week, year, order_type, 
        ordered_cases, loaded_cases, delivered_cases, delivered_cases_cost, 
        ordered_gallons, loaded_gallons, delivered_gallons, delivered_gallons_cost, 
        ordered_total, loaded_total, delivered_total, 
        return_frequency
    )

Final Data Set

glimpse(swire_data_full)
Rows: 1,045,540
Columns: 34
$ customer_number         <chr> "501202893", "500264574", "501174701", "600586…
$ primary_group_number    <chr> NA, "1894", NA, "8397", "1993", "437", NA, NA,…
$ primary_group_customers <dbl> 0, 47, 0, 49, 78, 651, 0, 0, 0, 0, 28, 0, 0, 8…
$ on_boarding_date        <date> 2021-04-02, 2015-12-08, 2021-01-26, 1997-02-2…
$ first_delivery_date     <date> 2021-05-07, 2018-03-23, 2021-04-12, 2017-05-0…
$ cold_drink_channel      <fct> DINING, WELLNESS, DINING, BULK TRADE, GOODS, G…
$ frequent_order_type     <fct> SALES REP, OTHER, MYCOKE360, SALES REP, SALES …
$ trade_channel           <fct> COMPREHENSIVE DINING, HEALTHCARE, FAST CASUAL …
$ sub_trade_channel       <fct> FSR - MISC, OTHER HEALTHCARE, OTHER FAST FOOD,…
$ local_market_partner    <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRU…
$ co2_customer            <lgl> TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,…
$ city                    <chr> "Mahaska", "Banner", "Chelmsford", "Bernardsto…
$ zip                     <chr> "66955", "41603", "01824", "01337", "67473", "…
$ state                   <fct> Kansas, Kentucky, Massachusetts, Massachusetts…
$ state_abbr              <fct> KS, KY, MA, MA, KS, KS, MD, MA, KY, KS, MD, KY…
$ county                  <chr> "Washington", "Floyd", "Middlesex", "Franklin"…
$ lat                     <dbl> 39.9845, 37.5707, 42.5911, 42.6952, 39.4194, 3…
$ lon                     <dbl> -97.3453, -82.6806, -71.3556, -72.5772, -98.69…
$ transaction_date        <date> 2023-01-05, 2023-01-06, 2023-01-09, 2023-01-1…
$ week                    <int> 1, 1, 2, 2, 3, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 7…
$ year                    <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023…
$ order_type              <fct> MYCOKE LEGACY, MYCOKE LEGACY, MYCOKE LEGACY, S…
$ ordered_cases           <dbl> 1.0, 12.5, 2.0, 18.0, 29.0, 1.5, 6.0, 0.0, 4.0…
$ loaded_cases            <dbl> 1.0, 12.5, 2.0, 16.0, 29.0, 1.5, 5.0, 0.0, 1.0…
$ delivered_cases         <dbl> 1.0, 12.5, 2.0, 16.0, 29.0, 1.5, 5.0, 0.0, 1.0…
$ delivered_cases_cost    <dbl> 8.585482, 56.436259, 17.170965, 49.293328, 134…
$ ordered_gallons         <dbl> 90.000000, 0.000000, 0.000000, 2.500000, 0.000…
$ loaded_gallons          <dbl> 90.000000, 0.000000, 0.000000, 2.500000, 0.000…
$ delivered_gallons       <dbl> 90.000000, 0.000000, 0.000000, 2.500000, 0.000…
$ delivered_gallons_cost  <dbl> 129.7426990, 0.0000000, 0.0000000, 7.1199352, …
$ ordered_total           <dbl> 91.000000, 12.500000, 2.000000, 20.500000, 29.…
$ loaded_total            <dbl> 91.000000, 12.500000, 2.000000, 18.500000, 29.…
$ delivered_total         <dbl> 91.000000, 12.500000, 2.000000, 18.500000, 29.…
$ return_frequency        <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…

Now, we have a single, standardized data set that is enriched, properly formatted, and well-suited to the remaining analysis.

Data Exploration

Fulfillment and Order Frequency

In the table below, I calculated the fulfillment rate for various customers. The fulfillment rate is defined as total delivered cases divided by total ordered cases.

  • A fulfillment rate of 1 means the customer receives exactly what they order.
  • A fulfillment rate greater than 1 indicates they are being delivered more than they ordered.

It may be worth investigating whether these customers have high return rates or if they consistently receive more than they request. If it’s the latter, this pattern could be an indicator of high sales demand.

# Calculate fulfillment rates by customer
fulfillment_rates <- swire_data_full %>%
  group_by(customer_number) %>%
  summarise(
    total_ordered_cases = sum(ordered_cases, na.rm = TRUE),
    total_delivered_cases = sum(delivered_cases, na.rm = TRUE),
    fulfillment_rate = ifelse(total_ordered_cases == 0, NA, total_delivered_cases / total_ordered_cases)
  ) %>%
  filter(!is.na(fulfillment_rate)) %>%  # Remove rows where fulfillment rate is undefined
  arrange(desc(fulfillment_rate))



# Define the customer numbers to keep for table
selected_customers <- c("501261447", "501563326", "501580838", "501686677", "501483916", "600563579",
                        "501004189", "501004936", "501006776", "501006936", "501007325", "501007477", "501007757")

# Summarizing the fulfillment rates and filtering the selected customers for table
filtered_fulfillment_rates <- swire_data_full %>%
  group_by(customer_number) %>%
  summarise(
    total_ordered_cases = sum(ordered_cases, na.rm = TRUE),
    total_delivered_cases = sum(delivered_cases, na.rm = TRUE),
    fulfillment_rate = ifelse(total_ordered_cases == 0, NA, total_delivered_cases / total_ordered_cases)
  ) %>%
  filter(customer_number %in% selected_customers) %>%  # Keep only the selected customers
  arrange(desc(fulfillment_rate))

# Display the filtered table
gt(filtered_fulfillment_rates) |>
  tab_header(
    title = "Fulfillment Rates for Selected Customers"
  )
Fulfillment Rates for Selected Customers
customer_number total_ordered_cases total_delivered_cases fulfillment_rate
501261447 3.0 18.0 6.0000000
501563326 1.0 6.0 6.0000000
501580838 2.0 12.0 6.0000000
501686677 1.0 6.0 6.0000000
600563579 159.0 915.0 5.7547170
501483916 43.0 221.5 5.1511628
501004189 50.0 50.0 1.0000000
501004936 10.5 10.5 1.0000000
501006776 801.0 801.0 1.0000000
501006936 52.0 52.0 1.0000000
501007325 20.0 20.0 1.0000000
501007757 76.5 76.5 1.0000000
501007477 113.0 109.0 0.9646018

The below plot indicates that Travel, Superstore, and Bulk Trade channels have the highest average total ordered, followed by General Activities and Academic Institutions. Other trade channels, including Healthcare, Recreation, and Defense, have lower average total ordered values. This comparison highlights that specific channels tend to have a very high average total ordered where others are consistently lower.

# Aggregate data to calculate the average total ordered per trade channel
top_trade_channels <- swire_data_full %>%
 group_by(trade_channel) %>%
 summarise(AVG_TOTAL_ORDERED = mean(ordered_cases + ordered_gallons, na.rm = TRUE)) %>%
 arrange(desc(AVG_TOTAL_ORDERED)) %>%
 slice_head(n = 20) # Select top 10 trade channels
# Create bar plot
ggplot(top_trade_channels, aes(x = reorder(trade_channel, -AVG_TOTAL_ORDERED), y = AVG_TOTAL_ORDERED, fill
= trade_channel)) +
 geom_bar(stat = "identity", alpha = 0.7, show.legend = FALSE) + # Bar plot without legend
 labs(
 title = "Top 10 Trade Channels by Average Total Ordered",
 x = "Trade Channel",
 y = "Average Total Ordered"
 ) +
 theme_minimal() +
 theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) # Rotate x-axis labels for readability

The graph below illustrates that Cruise, Online Store, and Recreation Park have the highest average total ordered volumes. Additionally, Bulk Trade, Other Travel, and Comprehensive Provider exhibit notable order volumes, further reinforcing their contribution to overall sales.

On the other hand, other sub-trade channels, such as Game Center, Recreation Film, and Fast Food, show comparatively lower average total ordered values. This comparison highlights the significant variation in demand across sub-trade channels, suggesting that order volume trends are strongly influenced by industry-specific purchasing behaviors.

# Aggregate data to calculate the average total ordered per sub-trade channel
top_10_sub_trade <- swire_data_full %>% 
  group_by(sub_trade_channel) %>% 
  summarise(AVG_TOTAL_ORDERED = mean(ordered_cases + ordered_gallons, na.rm =TRUE)) %>% 
  arrange(desc(AVG_TOTAL_ORDERED)) %>% 
  slice_head(n =20) # Select top 20 %>%
ggplot(top_10_sub_trade, aes(x = reorder(sub_trade_channel, -AVG_TOTAL_ORDERED), y = AVG_TOTAL_ORDERED, fill = sub_trade_channel)) + 
  geom_bar(stat ="identity", 
           alpha = 0.7, 
           show.legend =FALSE) + # Bar plot without legend
  labs( title ="Top 10 Sub Trade Channels by Average Total Ordered", x ="Sub Trade Channel", y ="Average Total Ordered") + 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle =45, hjust =1, vjust =1)) # Rotate x-axis labels for readability

Order Type

The graph below indicates that higher order volumes primarily come from MyCoke Legacy. This trend could be attributed to the fact that most sales originate from MyCoke Legacy and Sales Rep orders, or it may suggest a potential link between order type and high-growth potential customers.

When analyzing customers with < 400 volume, Call Center is the most common order type, while Sales Rep orders drop to third place. This shift suggests that order type could be a valuable characteristic for identifying high-growth potential customers, particularly among lower-volume segments. Further analysis could help determine whether order type patterns correlate with customer expansion trends.

# Aggregate order data by order type
order_type_summary <- swire_data_full %>%
  group_by(order_type) %>%
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  
  labs(title = "Total Ordered Cases by Order Type", x = "Order Type", y = "Total Gallons")

# Plot total ordered cases by order type for customers with < 400 gallons per year
swire_data_full %>%
  group_by(customer_number, order_type) %>%
  summarize(total_gallons = sum(ordered_gallons + ordered_cases, na.rm = TRUE)) %>%
  filter(total_gallons < 400) %>% 
  group_by(order_type) %>% 
  summarize(gallons = sum(total_gallons)) %>%
  ggplot(aes(x = reorder(order_type, -gallons), y = gallons, fill = order_type)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +  
  ggtitle("Barplot of Ordered Gallons by Order Type (Customers < 400 Gallons)") +
  labs(title = "Total Ordered Cases by Order Type", x = "Order Type", y = "Total Cases")
`summarise()` has grouped output by 'customer_number'. You can override using
the `.groups` argument.

Cold Drink Channel

The graphs below indicate that Dining and Goods are the primary cold drink channels for customers with a volume of less than 400 gallons, whereas Bulk Trade and Dining dominate across all customers. This distinction suggests that conducting a separate ‘Dining’ cold drink channel analysis could be valuable in comparing the characteristics and ordering behaviors between customers below and above the 400-gallon threshold.

# Plot total gallons ordered by channel
swire_data_full %>%
  group_by(cold_drink_channel) %>%
  summarize(
    gallons = sum(ordered_gallons + ordered_cases)  # Combine gallons and cases into a total volume
  ) %>%
  ggplot(aes(
    x = reorder(cold_drink_channel, -gallons),  # Reorder bars by descending volume
    y = gallons, 
    fill = cold_drink_channel  # Use color to differentiate channels
  )) +
  geom_bar(stat = "identity") +  # Use raw values (not counts) for bar height
  theme_minimal() +  # Apply clean minimal theme
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)  # Tilt x-axis labels for readability
  ) +
  ggtitle("Barplot of Ordered Gallons by Cold Drink Channel") +
  labs(x = "Cold Drink Channel", y = "Total Gallons")

# Plot total gallons for customers with fewer than 400 gallons ordered
swire_data_full %>%
  group_by(customer_number, cold_drink_channel) %>%
  summarize(
    total_gallons = sum(ordered_gallons + ordered_cases, na.rm = TRUE)  # Sum volume per customer/channel
  ) %>%
  filter(total_gallons < 400) %>%  # Filter to small-volume customers
  group_by(cold_drink_channel) %>%
  summarize(
    gallons = sum(total_gallons)  # Aggregate across small-volume customers
  ) %>%
  ggplot(aes(
    reorder(cold_drink_channel, -gallons),  # Reorder bars by descending volume
    y = gallons, 
    fill = cold_drink_channel
  )) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  ) +
  ggtitle("Ordered Gallons by Cold Drink Channel (Customers < 400 Gallons)") +
  labs(x = "Cold Drink Channel", y = "Total Gallons")
`summarise()` has grouped output by 'customer_number'. You can override using
the `.groups` argument.

Local Market Partners

The table below presents order volume and unit delta from 2023 to 2024 by customer, segmented by trade channel and sub-trade channel. Among the sub-trade channels, Cruises, Online Stores, Bulk Trade, and Comprehensive Providers exhibit higher unit delta year-over-year, indicating strong growth trends.

Additionally, the Fast Casual Dining trade channel stands out with notable year-over-year increases in order volume. This suggests that Fast Casual Dining customers within the local market partner segment could represent less obvious but high-growth potential opportunities, making them a strategic group to explore further.

# Year over year changes in gallon volume by customer
swire_data_full %>%
  filter(local_market_partner == TRUE & co2_customer == FALSE) %>%
  group_by(customer_number, year, trade_channel, sub_trade_channel) %>%
  summarize(gallons = sum(ordered_gallons + ordered_cases)) %>%
  pivot_wider(names_from = year, values_from = gallons) %>%
  mutate(unit_delta = (`2024` - `2023`)) %>%
  arrange(desc(unit_delta))
`summarise()` has grouped output by 'customer_number', 'year', 'trade_channel'.
You can override using the `.groups` argument.
# A tibble: 16,746 × 6
# Groups:   customer_number, trade_channel [16,746]
   customer_number trade_channel      sub_trade_channel `2023` `2024` unit_delta
   <chr>           <fct>              <fct>              <dbl>  <dbl>      <dbl>
 1 500733119       TRAVEL             CRUISE            139520 1.69e5     29102 
 2 600064039       TRAVEL             CRUISE            160752 1.88e5     27734.
 3 600573998       TRAVEL             CRUISE            376588 4.03e5     26077 
 4 500405618       BULK TRADE         BULK TRADE           588 2.60e4     25422 
 5 501031108       SUPERSTORE         ONLINE STORE       38909 6.12e4     22288.
 6 600261951       TRAVEL             CRUISE            106644 1.27e5     20502 
 7 600585734       TRAVEL             CRUISE             84808 1.03e5     17747 
 8 600576028       SUPERSTORE         ONLINE STORE       29744 4.65e4     16738.
 9 501202275       PROFESSIONAL SERV… OTHER PROFESSION…    216 1.67e4     16494 
10 501562278       GENERAL            COMPREHENSIVE PR…   5858 2.22e4     16343 
# ℹ 16,736 more rows

The graphs below illustrate the distribution of order types among local market partners. Notably, the EDI order type is the most dominant among local market partners. However, among customers with a volume of less than 400 gallons, Call Center and MyMoke360/Legacy orders remain the preferred choices.

While this difference may not seem highly significant, the shift in order type preference for customers with < 400 gallons—both in the local market partner segment and across all customers—could indicate that this variable holds some importance in customer behavior.

# For all local market partner customers
swire_data_full %>%
  filter(local_market_partner == TRUE & co2_customer == FALSE) %>%
  group_by(order_type) %>%
  summarize(
    gallons = sum(ordered_gallons + ordered_cases)  # Combine gallons and cases into total volume
  ) %>%
  ggplot(aes(
    x = reorder(order_type, -gallons),  # Reorder bars by descending volume
    y = gallons, 
    fill = order_type
  )) +
  geom_bar(stat = "identity") +  # Use actual values for bar height
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)  # Tilt x-axis labels for readability
  ) +  
  ggtitle("Barplot of Ordered Gallons by Order Type") +
  labs(
    title = "Total Ordered Cases by Order Type", 
    x = "Order Type", 
    y = "Total Cases"
  )

# For customers with < 400 gallons per year
swire_data_full %>%
  filter(local_market_partner == TRUE & co2_customer == FALSE) %>%
  group_by(customer_number, order_type) %>%
  summarize(
    total_gallons = sum(ordered_gallons + ordered_cases, na.rm = TRUE)  # Sum volume by customer/order type
  ) %>%
  filter(total_gallons < 400) %>%
  group_by(order_type) %>%
  summarize(
    gallons = sum(total_gallons)  # Aggregate total gallons for small-volume customers
  ) %>%
  ggplot(aes(
    x = reorder(order_type, -gallons), 
    y = gallons, 
    fill = order_type
  )) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  ) +  
  ggtitle("Barplot of Ordered Gallons by Order Type (Customers < 400 Gallons)") +
  labs(
    title = "Total Ordered Cases by Order Type", 
    x = "Order Type", 
    y = "Total Cases"
  )
`summarise()` has grouped output by 'customer_number'. You can override using
the `.groups` argument.

The visuals below indicate that non-partners generally exhibit higher median total orders and a wider range of order volumes, with some extreme outliers. In contrast, local market partners tend to have lower total orders and a smaller interquartile range, suggesting less variability in their order quantities.

This difference in distribution may reflect distinct ordering behaviors between the two groups, warranting further analysis to understand the factors influencing order volume consistency among local market partners.

# Define threshold for filtering (optional)
NUMBER <- 1000 # Exclude extreme values if needed

# Filter dataset to exclude outliers (optional)
ORDERED_UNDER_NUMBER <- swire_data_full %>% 
  filter(ordered_gallons + ordered_cases <= NUMBER)

# Create boxplot to compare TOTAL_ORDERED by LOCAL_MARKET_PARTNER status
ggplot(ORDERED_UNDER_NUMBER, aes(x = local_market_partner, y = ordered_gallons + ordered_cases, fill = local_market_partner)) + 
  geom_boxplot(alpha =0.6) +# Boxplot with transparency
labs( title = paste("LOCAL_MARKET_PARTNER vs. TOTAL_ORDERED (Excluding Orders >", NUMBER,")"), x ="Local Market Partner Status", y ="Total Ordered") + theme_minimal()

# Compute average TOTAL_ORDERED for Local Market Partners (TRUE) and Non-Partners (FALSE)
avg_total_order <- swire_data_full %>% 
  group_by(local_market_partner) %>% 
  summarise(AVG_TOTAL_ORDERED = mean(ordered_gallons + ordered_cases, na.rm =TRUE))

# Display result
print(avg_total_order)
# A tibble: 2 × 2
  local_market_partner AVG_TOTAL_ORDERED
  <lgl>                            <dbl>
1 FALSE                             57.6
2 TRUE                              32.2

C02 Customer

CO2 customers are an interesting subset of the data. We can look at customers who purchase fountain products (i.e.e “ordered_gallons”) and compare their distribution across CO2 customer status and even adjust for local market partner status.

swire_data_full |>
  group_by(co2_customer, customer_number, local_market_partner) |>
  summarise(
    total = sum(ordered_gallons)  # Sum total gallons per customer
  ) |>
  ungroup() |>
  filter(total > 0) |>  # Exclude customers with zero volume

  ggplot(
    aes(x = local_market_partner, total)  # Compare total volume by LMP status
  ) +
    geom_boxplot(color = swire_colors$red) +  # Red boxplots
    scale_y_log10() +  # Log scale to account for skewed volume distribution
    facet_wrap(~co2_customer) +  # Separate plots for CO2 vs non-CO2 customers
    labs(
      title = "Distribution of Total Volume Among 'Fountain' Products", 
      subtitle = "Grouped by status as a CO2 customer", 
      x = "Local Market Partner?"
    ) +
    theme_swire() +
    theme(
      axis.title.y = element_blank()  # Omit y-axis title for cleaner look
    )
`summarise()` has grouped output by 'co2_customer', 'customer_number'. You can
override using the `.groups` argument.

Among those order fountain products, we really aren’t seeing any measurable difference. There’s subtlties in the outliers but non-CO2 customers are meaningfully different in order volume. Even adjusting for local_market_partner isn’t demonstrating any meaningful pattern.

Customer Tenure

An interesting component to this analysis is varying levels of tenure, where some customers are “new” (their first order in the time window) while others having been customers for ages. We can look at order volume by tenure (years) for any patterns.

swire_data_full |>
  group_by(customer_number) |>
  summarise(
    tenure_years = floor(max(as.integer(transaction_date - on_boarding_date) / 365)),  # Calculate tenure in full years
    total = sum(ordered_total)  # Total volume per customer
  ) |>
  ungroup() |>
  mutate(
    tenure_years = as.factor(tenure_years)  # Convert tenure to factor for discrete x-axis
  ) |>

  ggplot(
    aes(tenure_years, total)  # Compare volume across tenure groups
  ) +
    geom_boxplot(color = swire_colors$red) +  # Red boxplots
    scale_x_discrete() +  # Ensure tenure groups appear as categories
    scale_y_log10() +  # Log scale for better visibility of spread
    labs(
      title = "Distribution of Total Volume", 
      subtitle = "Organized by tenure and order type"
    ) +
    theme_swire()
Warning in scale_y_log10(): log-10 transformation introduced infinite values.
Warning: Removed 137 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Predictably, customers with less tenure than our two (2) year window will have less than others. There also appears to be many outliers during the initial years of tenure. Later on, however, there’s more variability in the IQR. We can expand this even more by introducing another variable and simplifying to tenure bins.

swire_data_full |>
  group_by(customer_number, order_type) |>
  summarise(
    tenure_years = floor(max(as.integer(transaction_date - on_boarding_date) / 365)),  # Calculate full years of tenure
    total = sum(ordered_total)  # Total volume per customer/order type
  ) |>
  ungroup() |>
  mutate(
    tenure_years_bin = as.factor(cut(tenure_years, breaks = 4))  # Bin tenure into 4 equal-width ranges and convert to factor
  ) |>

  ggplot(
    aes(tenure_years_bin, total)  # Plot volume by tenure bin
  ) +
    geom_boxplot(color = swire_colors$red) +  # Red boxplots
    scale_y_log10() +  # Log scale to account for skew
    facet_wrap(~order_type) +  # Separate plots by order type
    labs(
      title = "Distribution of Total Volume", 
      subtitle = "Organized by tenure and order type"
    ) +
    theme_swire()
`summarise()` has grouped output by 'customer_number'. You can override using
the `.groups` argument.
Warning in scale_y_log10(): log-10 transformation introduced infinite values.
Warning: Removed 2422 rows containing non-finite outside the scale range
(`stat_boxplot()`).

We see here that there really isn’t any pattern for time/tenure playing into the equation. We could look at the same for cold_drink_channel and the same bears true. It suggests that there’s no general trend of time being a factor for increased order volume, even down to sub groups.

Geographic Order Volume

We can also get a sense for concentration of orders by geography. We’ll build off of the map we created above but scale the size of the points to the volume of orders.

swire_cust_vol <- 
  swire_data_full |>
  group_by(customer_number, lon, lat, local_market_partner) |>
  summarise(
    total = sum(ordered_total)  # Total ordered volume per customer
  ) |>
  ungroup() |>
  mutate(
    total_scaled = (total / max(total)) * 100  # Scale volume to 0–100 range for visualization
  )
`summarise()` has grouped output by 'customer_number', 'lon', 'lat'. You can
override using the `.groups` argument.
swire_map_vol <-                                                              
    leaflet(height = 800, width = 800) |>  # Initialize leaflet map with set dimensions
    addTiles() |>  # Add default OpenStreetMap base tiles
    setView(
        lng = mean(swire_cust_vol$lon),  # Center longitude at the mean customer location
        lat = mean(swire_cust_vol$lat),  # Center latitude at the mean customer location
        zoom = 5  # Set zoom level for a broad regional view
    ) |>
    addProviderTiles("CartoDB.Positron") |>  # Use a clean, minimal basemap
    addCircleMarkers(
        data = swire_cust_vol,
        lng = ~lon,  # Longitude from dataset
        lat = ~lat,  # Latitude from dataset
        radius = ~sqrt(total_scaled),  # Use square root to visually balance the scaled volume
        color = ~swire_colors$red,  # Use custom red from palette
        fillOpacity = 0.15  # Low opacity to reduce visual clutter when markers overlap
    )

swire_map_vol

There are certainly hot beds of more activity than others but there isn’t a strong rhyme or reason pattern that’s noticed from this view. Better still would be quantifying density to other delivery points and/or to sister franchises. We could use seomthing like the haversine formula as a proxy.

Conclusion

This exploratory data analysis (EDA) exercise has established a foundation for our group in some key areas:

  1. We are far more familiar with Swire Coca-Cola’s business model, data structure, and customer nuances. This will equip us with necessary context for the modeling stage.

  2. We have a solid understanding of areas to model and explore more deeply given the lessons learned and findings extracted. This means we’ll be more efficient and effective.

  3. Our team has yielded “proof of concept” in how we collaborate cognitively (i.e. exploration of ideas) and structurally. This is crucial as the group looks to deliver a powerful solution to Swire Coca-Cola by project end.

During the next phase of the project, Modeling, our team will leverage these insights with more advanced analysis techniques, such as leveraging cluster algorithms, cost savings models, and both temporal and spatial analyses. This is all in the effort to develop a framework for better distinguishing high-growth potential from low-growth customers, offering a framework that improves on the current 400-gallon threshold.

Division of Work

Formulation of this group EDA was made possible thanks to the collaborative partnership from all group members.

As a group, we’ve partnered in the following ways:

  • Worked out of a group GitHub repository, following best practices of commits to branches and issuing pull requests, with teammates reviewing, approving, and merging work to main.

  • Partnered on data wrangling logic, determining how we would clean, standardize, and resolve inherent constraints of the provided data. This was the recipe for individual work described below.

  • Consolidated individual work into this document, ensuring thoughtful organization, engaging content, and attractive design.

Each member contributed with individual work as described below:

  • Indvidiually, we created a unique EDA document comprised of our own insights through extensive research and time with the data sets.

  • Individually, we created a cleaning script adhering to the defined plan as described above. This approach helped ensure all gained valuable programming experience and were intimately familiar with our cleaned data.

Thank you,

Adam, Georgia, Tyler, Zac